You've been given a dataset of about 500 malware binary files that have
been found on your organization's computers. Whenever you find more malware,
you want to be able to tell if you've seen a file like this before.

Binary files are hard to understand. When code is written, there are several
more steps before it becomes software. Some parts of this process are:
i.  Compiling, which turns human-readable source code into assembly code.
    Assembly code is difficult for humans to read, but it closely mimics the most
    basic raw instructions that a computer needs in order to run a program.
ii. Assembling, which turns assembly code into machine code. Machine code is
    impossible for humans to read, but this representation is what a computer
    actually needs to execute.

The malware binary files that were given to you to analyze are all in machine
code, but luckily, you were able to run a program called a disassembler to
turn them back into assembly code.

Assembly code contains *instructions* which tell a computer how to update
its own internal memory, and its progress through reading the assembly code
itself. For instance, the `jmp` instruction means "jump to executing a
different instruction", and the `add` instruction means "add two numbers and
store the result in memory".

Your dataset contains data about all the malware files, including their
file hash, which serves as a name, and the counts of all of the `jmp` and `add`
instructions.

Malware attackers often release many slightly different versions of the same
malware over time. These different versions always have totally different
hashes, but they are likely to have similar numbers of `jmp` and `add`
instructions.